Human learners use different ways to transfer knowledge between tasks to achieve maximum efficiency. This works on the premise that the more related a new task is to previous experience, the easier it is to master the task. Much analogous to this is transfer learning in natural language processing. Transfer learning is the process of deriving knowledge from one source task and applying it to a different target domain. “Source and target setting of the selected task,” “nature of the source and the target domain,” and the “order in which tasks are learned” are three dimensions based on which transfer learning can be classified. The following four broad categories of transfer learning classified on the three dimensions exist:
Domain Adaptation: We learn from a source dataset for a task and apply the best performing model on a target dataset for a different, but related, task.
Cross-lingual learning: We learn from a source language and apply the best performing model on a linguistically close target language.
Multitask Learning: We learn multiple related tasks at the same time. This helps in exploiting commonalities and differences across tasks, thus enabling the model to learn a better representation of the overall task at hand.
Sequential Transfer Learning: We sequentially learn multiple tasks, in which source and target task may be remarkably different from each other. However, in this form of transfer learning, it is essential to ensure that tasks immediately following each other should be similar enough to help in learning a better representation of the target task.
Among various approaches to implementing transfer learning, the following three are the most commonly used ones:
Develop Model Approach: In this approach, we train a source model skillful for one task and reuse it for another task after properly tuning the model.
Pretrained Model Approach: In this approach, we use one of the widely available models like ULMFiT and BERT, which is trained for a generic task like next sentence prediction (like in the case of BERT) and fine tune it for the task at hand.
Feature Extraction Approach: In this approach, also known as representation learning, we use deep learning to discover the best representation of the task by finding the most important features for it.
The last two years saw an increase in the architectures and models that use transfer learning for processing information given in natural languages. Much of these are based on the pre-trained model approach of transfer learning since it’s relatively new and obtains competitive results without much computation and time required for the fine tuning process. It is often used in data science challenges and is extensively researched in academia. Among the fine tuned models, the following are essential to understand for their contribution to NLP: ULMFiT, BERT, and BERT-based models.
ULMFiT, introduced by Jeremy Howard et al. in 2018, is a robust inductive transfer learning-based technique that can be used to solve any NLP task. It uses three-layer LSTM based architecture pre-trained on a large general corpus, which can be fined tuned on a specific downstream task.
BERT, introduced by Devlin et al. in 2018, focuses on learning contextualized word representations, requiring just minutes to achieve highly competitive results in most NLP tasks, including question answering, text classification, and sentence generation, etc.
Due to the advances in BERT-based models, recent works show that optimizations in BERT can reduce not only time complexity and memory complexity for the process of pretraining (DistillBERT, ALBERT) but also improve the performance of the model (RoBERTa, ALBERT). It thus becomes imperative to explore models like RoBERTa, ALBERT, and DistillBERT for most NLP-related tasks.
There are many advantages of transfer learning, like:
Improved Model Performance: It improves model performance by capturing semantic and syntactic relations between words in a natural language.
Reduced Training Time: It reduces the time taken to train the model for a downstream task.
Reduced Training Data for Fine Tuning: It reduces the data needed to fine tune the model to process the information given in natural languages.
Reduced Experimentation Time: The pre-trained architecture (often open-sourced by the NLP community) reduces the time required to perform experimentation
Reduced Code Complexity: The wide-variety of libraries that facilitate transfer learning like the Transformers library for BERT-based models in Pytorch help reduce complexity of code in practice.
Therefore, transfer learning in NLP becomes instrumental when someone has insufficient data for a new domain and a big data pool from another domain from which information can be transferred to the target domain. With recent developments, the use of transfer learning in the NLP community has become ubiquitous, and it thus becomes necessary to learn for data Science enthusiasts.